This is a complement to the basic statistical analysis that was done based on the data downloaded from the US Department of Education collegescorecard web site.
The US department of education , provides the option to download the data sets used in the tool from the following website:
https://collegescorecard.ed.gov/data/
The analsysis uses the data in the file FieldOfStudyData1516_1617_PP.csv; combining it with Institution information to display a financial comparison of programs. The Earning by Field of study data shows the Median Earnings (and Debt) of Students as of 1 year after graduation for the classes 2015_16 anf 2016_17.
The Data has been pre-processed to show only the records for which Earnings and Tuition information is available. The data has 19,563 records with 14 columns.
FieldofStudy_Earnings_for_rep = read.csv("C:/Users/figue/Documents/Collegeboard_data/Field_of_Studies_Earn_rep.csv", stringsAsFactors = FALSE)
dim(FieldofStudy_Earnings_for_rep)
## [1] 19563 14
colnames(FieldofStudy_Earnings_for_rep)
## [1] "INSTNM" "MAIN" "CIPDESC"
## [4] "STABBR" "DEBTMEDIAN" "MD_EARN_WNE"
## [7] "ADM_RATE" "SAT_AVG" "Tuition_Average"
## [10] "Tuition_more_110K" "Debt_payback_YR" "Tuit_Avg_4Y_Pyback"
## [13] "Tuit_Inc_4Y_Pyback" "Salary_10Y_4Y_Tui"
In this analysis, we will review how the words contained in a field of study affect the potential earnings. I will also use SAT scores and Admission rates as predictors of future earnings. This data will be used to create linear regression models that will be discussed later in this document.
The reader might be thinking whether the State where the Institution is located might have something to do with the financial outcome. Prior to moving forward , let’s create a simple linear regression model to test this hypothesis.
lm.Ear.cost = lm(Salary_10Y_4Y_Tui ~ STABBR + ADM_RATE +SAT_AVG , data=FieldofStudy_Earnings_for_rep)
summary(lm.Ear.cost)
##
## Call:
## lm(formula = Salary_10Y_4Y_Tui ~ STABBR + ADM_RATE + SAT_AVG,
## data = FieldofStudy_Earnings_for_rep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -360598 -94744 -34884 71607 911919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 193623.07 64986.16 2.979 0.002892 **
## STABBRAL -69946.69 63268.11 -1.106 0.268934
## STABBRAR -71263.85 63754.06 -1.118 0.263672
## STABBRAZ -6095.70 63843.42 -0.095 0.923936
## STABBRCA -58024.91 62789.80 -0.924 0.355441
## STABBRCO -63852.43 63188.43 -1.011 0.312267
## STABBRCT -56231.28 63624.63 -0.884 0.376819
## STABBRDC -74339.47 64634.13 -1.150 0.250096
## STABBRDE -15518.07 66360.61 -0.234 0.815109
## STABBRFL -75210.13 62928.31 -1.195 0.232038
## STABBRGA -63645.53 63001.20 -1.010 0.312402
## STABBRHI -54241.71 64903.56 -0.836 0.403321
## STABBRIA -42063.28 63229.22 -0.665 0.505900
## STABBRID -35680.55 64027.34 -0.557 0.577351
## STABBRIL -50045.53 62899.62 -0.796 0.426253
## STABBRIN -28347.04 62998.87 -0.450 0.652745
## STABBRKS -26540.53 64013.00 -0.415 0.678432
## STABBRKY -59373.37 63256.57 -0.939 0.347944
## STABBRLA -66690.71 63346.89 -1.053 0.292455
## STABBRMA -53643.40 62992.72 -0.852 0.394460
## STABBRMD -54281.10 63395.55 -0.856 0.391884
## STABBRME -11604.60 65429.24 -0.177 0.859227
## STABBRMI -29412.68 62910.44 -0.468 0.640125
## STABBRMN -47874.19 63032.33 -0.760 0.447554
## STABBRMO -57315.98 63104.08 -0.908 0.363746
## STABBRMS -75282.94 64037.35 -1.176 0.239769
## STABBRMT -83519.67 64917.79 -1.287 0.198272
## STABBRNC -75089.31 62950.81 -1.193 0.232956
## STABBRND 23123.61 64910.38 0.356 0.721666
## STABBRNE -38356.62 63992.91 -0.599 0.548922
## STABBRNH -61505.82 64482.65 -0.954 0.340182
## STABBRNJ -50655.22 63005.88 -0.804 0.421423
## STABBRNM -51641.68 64654.37 -0.799 0.424456
## STABBRNV -28959.93 64816.10 -0.447 0.655025
## STABBRNY -74742.45 62802.73 -1.190 0.234019
## STABBROH -60960.34 62872.78 -0.970 0.332270
## STABBROK -29787.34 63714.38 -0.468 0.640139
## STABBROR -76818.42 63392.61 -1.212 0.225612
## STABBRPA -50970.14 62774.11 -0.812 0.416826
## STABBRPR -251585.61 84853.29 -2.965 0.003032 **
## STABBRRI -56247.71 64140.00 -0.877 0.380526
## STABBRSC -95521.38 63224.69 -1.511 0.130853
## STABBRSD -30873.04 64626.91 -0.478 0.632862
## STABBRTN -72496.08 63129.85 -1.148 0.250836
## STABBRTX -34984.68 62795.81 -0.557 0.577455
## STABBRUT -49569.47 63852.96 -0.776 0.437580
## STABBRVA -63205.35 63029.24 -1.003 0.315976
## STABBRVI -129905.10 153446.52 -0.847 0.397241
## STABBRVT -108198.31 65250.59 -1.658 0.097298 .
## STABBRWA -32416.56 63081.50 -0.514 0.607340
## STABBRWI -38163.20 62996.13 -0.606 0.544655
## STABBRWV -32993.22 63870.87 -0.517 0.605470
## STABBRWY -42190.39 67872.73 -0.622 0.534207
## ADM_RATE -30881.73 8221.75 -3.756 0.000173 ***
## SAT_AVG 162.99 11.74 13.882 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140100 on 15685 degrees of freedom
## (3823 observations deleted due to missingness)
## Multiple R-squared: 0.03911, Adjusted R-squared: 0.0358
## F-statistic: 11.82 on 54 and 15685 DF, p-value: < 2.2e-16
First of all you can tell that the Linear regression model is not very good. Not too much is explained by the variables chosen, and the only State/Terrirory that is statistical significant is Puerto Rico (STABBRPR). By themselves, the SAT average and the Admission rate are statistical significant but they do not predict a lot (R-squared: 0.03911). We are missing the Field of Study information, this is what matters the most.
Since Puerto Rico is relevant , I will add a variable to keep track of this Territory and I will avoid using the feature STABBR which contains the values for all states.
FieldofStudy_Earnings_for_rep$Puerto_Rico = ifelse(FieldofStudy_Earnings_for_rep$STABBR=="PR",1,0)
Let’s continue by splitting the description of the fields of study into individual words. The tm package in r is very helpful to complete this. There are many different words in the descriptions of the fields of study, it is important to keep the ones that repeat the most , those are the most relevant.
library(tm)
## Loading required package: NLP
#CIPDESC is the description . I would like to analyze
docs = FieldofStudy_Earnings_for_rep$CIPDESC
#Vector Source can create a Corpuse from the vector
Field.Corpus = VCorpus(VectorSource(docs))
#Transformations to normalize data, remove common words and punctuation
#Conver to lower case
Field.Corpus = tm_map(Field.Corpus, content_transformer(tolower))
#there are still dots
#remove stop words
Field.Corpus = tm_map(Field.Corpus, removeWords, stopwords("english"))
#punctuation
Field.Corpus = tm_map(Field.Corpus, removePunctuation)
#Document term matrix
Field.matrix = DocumentTermMatrix(Field.Corpus)
# Convert to normal matrix in r
r.field.matrix = as.matrix(Field.matrix)
#Add the description names to
rownames(r.field.matrix) = FieldofStudy_Earnings_for_rep$CIPDESC
#find most commong words
count_of_word = apply(r.field.matrix,2,sum)
count_of_word = count_of_word[order(count_of_word,decreasing=TRUE)]
#the most common is general 3,456 times
#there a total of 380 words. How many repeat more than 200 times in degreese
sum(count_of_word>200)
## [1] 72
# 72. Keep those 72
count_of_word = count_of_word [count_of_word>200]
#now just keep those columns in the r.field.matrix
r.field.matrix.red = r.field.matrix[,names(count_of_word)]
r.field.matrix.red = cbind(r.field.matrix.red, FieldofStudy_Earnings_for_rep[,c("Puerto_Rico","ADM_RATE","SAT_AVG","Salary_10Y_4Y_Tui")])
#change state to Factor
r.field.matrix.red$Puerto_Rico = as.factor(r.field.matrix.red$Puerto_Rico)
colnames(r.field.matrix.red)
## [1] "general" "nursing"
## [3] "management" "services"
## [5] "administration" "studies"
## [7] "engineering" "sciences"
## [9] "arts" "business"
## [11] "operations" "related"
## [13] "health" "education"
## [15] "research" "development"
## [17] "psychology" "clinical"
## [19] "computer" "communication"
## [21] "professional" "specific"
## [23] "teacher" "biology"
## [25] "registered" "science"
## [27] "corrections" "criminal"
## [29] "justice" "media"
## [31] "methods" "accounting"
## [33] "levels" "information"
## [35] "marketing" "language"
## [37] "literature" "english"
## [39] "applied" "physical"
## [41] "social" "educationfitness"
## [43] "humanities" "liberal"
## [45] "government" "political"
## [47] "sociology" "human"
## [49] "fine" "studio"
## [51] "resources" "economics"
## [53] "finance" "financial"
## [55] "professions" "public"
## [57] "history" "mechanical"
## [59] "communications" "work"
## [61] "areas" "subject"
## [63] "allied" "design"
## [65] "electrical" "businesscommerce"
## [67] "medical" "technologiestechnicians"
## [69] "relations" "administrative"
## [71] "systems" "electronics"
## [73] "Puerto_Rico" "ADM_RATE"
## [75] "SAT_AVG" "Salary_10Y_4Y_Tui"
Several of these words are part of a single Field of Study description so they will be perfectly correlated. This is not benefitial for a linear regression (and in general for any machine learning algorithm). The following code removes the fields that are perfectly correlated.
# Using Apply I can find the counts that are identical across words. By removing duplicates a eliminate perfect correlation
# and remove the columns with duplicated numbers r keep the ones that are unique
colnames(r.field.matrix.red[,c(1:72)])[!duplicated(apply(r.field.matrix.red[,c(1:72)],2,sum))]
## [1] "general" "nursing"
## [3] "management" "services"
## [5] "administration" "studies"
## [7] "engineering" "sciences"
## [9] "arts" "business"
## [11] "operations" "related"
## [13] "health" "education"
## [15] "research" "development"
## [17] "psychology" "clinical"
## [19] "computer" "communication"
## [21] "professional" "specific"
## [23] "biology" "registered"
## [25] "science" "corrections"
## [27] "media" "methods"
## [29] "accounting" "levels"
## [31] "information" "marketing"
## [33] "language" "literature"
## [35] "english" "applied"
## [37] "physical" "social"
## [39] "educationfitness" "humanities"
## [41] "government" "sociology"
## [43] "human" "fine"
## [45] "resources" "economics"
## [47] "finance" "professions"
## [49] "public" "history"
## [51] "mechanical" "communications"
## [53] "work" "areas"
## [55] "allied" "electrical"
## [57] "businesscommerce" "medical"
## [59] "technologiestechnicians" "relations"
## [61] "administrative" "systems"
## [63] "electronics"
r.field.matrix.red_nocor = r.field.matrix.red[,c(colnames(r.field.matrix.red[,c(1:72)])[!duplicated(apply(r.field.matrix.red[,c(1:72)],2,sum))],"Puerto_Rico","ADM_RATE","SAT_AVG","Salary_10Y_4Y_Tui")]
#Do a linear regression
lm.Ear.cost4 = lm(Salary_10Y_4Y_Tui ~ . , data=r.field.matrix.red_nocor)
summary(lm.Ear.cost4)
##
## Call:
## lm(formula = Salary_10Y_4Y_Tui ~ ., data = r.field.matrix.red_nocor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -347454 -43401 -2272 37860 823853
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.719e+05 9.533e+03 18.031 < 2e-16 ***
## general -3.074e+03 3.483e+03 -0.883 0.377407
## nursing 2.180e+05 2.621e+04 8.318 < 2e-16 ***
## management 3.085e+04 5.605e+03 5.503 3.79e-08 ***
## services -2.607e+04 5.539e+03 -4.707 2.54e-06 ***
## administration 4.883e+04 7.816e+03 6.247 4.29e-10 ***
## studies -3.257e+04 3.696e+03 -8.812 < 2e-16 ***
## engineering 2.359e+05 3.315e+03 71.169 < 2e-16 ***
## sciences -1.629e+04 4.075e+03 -3.999 6.40e-05 ***
## arts -1.301e+05 4.619e+03 -28.174 < 2e-16 ***
## business 2.871e+04 6.667e+03 4.306 1.67e-05 ***
## operations -4.024e+04 7.587e+03 -5.304 1.15e-07 ***
## related -8.184e+03 6.180e+03 -1.324 0.185408
## health -1.352e+04 6.043e+03 -2.237 0.025284 *
## education 6.934e+03 6.161e+03 1.125 0.260443
## research -2.427e+04 9.882e+03 -2.456 0.014056 *
## development -6.105e+04 1.374e+04 -4.444 8.89e-06 ***
## psychology -6.558e+04 4.605e+03 -14.242 < 2e-16 ***
## clinical -2.138e+04 9.317e+03 -2.295 0.021746 *
## computer 1.343e+05 5.191e+03 25.876 < 2e-16 ***
## communication -4.372e+04 5.555e+03 -7.870 3.77e-15 ***
## professional 4.690e+04 5.656e+04 0.829 0.407036
## specific 1.107e+04 5.959e+04 0.186 0.852609
## biology -6.847e+04 4.217e+03 -16.236 < 2e-16 ***
## registered -5.906e+05 1.060e+05 -5.573 2.54e-08 ***
## science 1.357e+05 6.516e+03 20.818 < 2e-16 ***
## corrections -1.109e+04 4.369e+03 -2.538 0.011152 *
## media 3.781e+04 6.996e+03 5.404 6.62e-08 ***
## methods 1.432e+05 1.195e+04 11.987 < 2e-16 ***
## accounting 1.663e+05 8.342e+03 19.939 < 2e-16 ***
## levels -1.407e+05 1.342e+04 -10.490 < 2e-16 ***
## information 1.370e+05 6.083e+03 22.524 < 2e-16 ***
## marketing 5.630e+04 4.377e+03 12.862 < 2e-16 ***
## language -2.571e+04 1.300e+04 -1.978 0.047982 *
## literature -5.854e+04 2.281e+04 -2.566 0.010291 *
## english 7.705e+03 2.635e+04 0.292 0.769990
## applied 9.586e+04 6.018e+03 15.929 < 2e-16 ***
## physical -4.414e+04 1.592e+04 -2.773 0.005554 **
## social -3.519e+04 8.799e+03 -4.000 6.38e-05 ***
## educationfitness -1.126e+03 1.787e+04 -0.063 0.949760
## humanities 1.488e+05 8.721e+03 17.064 < 2e-16 ***
## government -1.599e+05 7.644e+03 -20.918 < 2e-16 ***
## sociology -4.857e+04 4.667e+03 -10.407 < 2e-16 ***
## human 6.816e+04 1.043e+04 6.534 6.61e-11 ***
## fine 2.776e+04 6.300e+03 4.406 1.06e-05 ***
## resources -1.664e+04 9.669e+03 -1.721 0.085197 .
## economics 9.341e+04 4.797e+03 19.473 < 2e-16 ***
## finance 1.262e+05 8.786e+03 14.359 < 2e-16 ***
## professions 3.322e+04 1.006e+04 3.300 0.000968 ***
## public -1.257e+04 7.234e+03 -1.738 0.082265 .
## history -5.759e+04 4.959e+03 -11.612 < 2e-16 ***
## mechanical 2.010e+04 5.451e+03 3.687 0.000228 ***
## communications -6.810e+04 1.068e+04 -6.377 1.86e-10 ***
## work 1.618e+04 1.023e+04 1.581 0.113796
## areas NA NA NA NA
## allied 1.562e+05 1.211e+04 12.902 < 2e-16 ***
## electrical -4.625e+04 1.913e+04 -2.418 0.015609 *
## businesscommerce 6.806e+04 7.203e+03 9.449 < 2e-16 ***
## medical -1.168e+05 2.527e+04 -4.622 3.84e-06 ***
## technologiestechnicians 5.536e+04 7.256e+03 7.630 2.48e-14 ***
## relations -3.245e+04 7.702e+03 -4.213 2.53e-05 ***
## administrative 2.046e+05 2.704e+04 7.568 4.00e-14 ***
## systems 4.794e+04 9.947e+03 4.820 1.45e-06 ***
## electronics 1.987e+05 2.304e+04 8.626 < 2e-16 ***
## Puerto_Rico1 -1.868e+05 3.218e+04 -5.806 6.50e-09 ***
## ADM_RATE -3.084e+04 3.923e+03 -7.860 4.08e-15 ***
## SAT_AVG 1.030e+02 6.481e+00 15.889 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78600 on 15674 degrees of freedom
## (3823 observations deleted due to missingness)
## Multiple R-squared: 0.6976, Adjusted R-squared: 0.6964
## F-statistic: 556.3 on 65 and 15674 DF, p-value: < 2.2e-16
#areas seem to be highly correlated.Yields singularities Remove.
The word areas seems to be correlated to other variables. There are other words that are highly correlated to ohters that should be removed. In the next lines of code, I check for the correlated variables and remove a few of them manually. In addition to this, I will remove words that are not statistically significant. After this is done a new linear regression is created.
#check correlated variables
#this returns the pairs.
#I need to keep only the top
Corr.m.9 = which(cor(r.field.matrix.red[,1:72])>.9 , arr.ind=TRUE)
#remove diagonal
Corr.m.9=Corr.m.9[!(Corr.m.9[,"row"] == Corr.m.9[,"col"]),]
#bind names that are too correlated
corr_pairs = cbind(rownames(cor(r.field.matrix.red[,1:72]))[Corr.m.9[,"row"]],colnames(cor(r.field.matrix.red[,1:72]))[Corr.m.9[,"col"]])
# Choose which word is more relevant or better describes a degree. The other words will have to be removed
#start with nursing
#for example this is highly correlated to clinical and registerd
#those should be removed
words_remove = corr_pairs[corr_pairs[,1]=="nursing",2]
#for keeping track remove from corr_pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#now look fo teacher. Development, professional, specific
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="teacher",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#justice
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="justice",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#methods
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="methods",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#literature
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="literature",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#educationfitness
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="educationfitness",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#humanities
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="humanities",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#government
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="government",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#fine
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="fine",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#finance
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="finance",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#subject
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="subject",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#electronics
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="electronics",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#medical
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="medical",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#remove all fields that correspond to words that are highly correlated
r.field.matrix.red_nocor = r.field.matrix.red_nocor[,!(colnames(r.field.matrix.red_nocor) %in% words_remove)]
lm.Ear.cost4 = lm(Salary_10Y_4Y_Tui ~ . , data=r.field.matrix.red_nocor)
#I will remove coefficients that are not significant more than .05
lm.Ear.cost4$coefficients[2]
## general
## -942.0453
# retrieve the names of the variables that are significan
names.to.keep = names(summary(lm.Ear.cost4)[["coefficients"]][,"Pr(>|t|)"])[summary(lm.Ear.cost4)[["coefficients"]][,"Pr(>|t|)"]<.05]
#except for intercept these are names in r.field.matrix.red_nocor. Remember that we also have to keep Salary_10Y_4Y_Tui
#For some reason Puerto_Rico1 is presented as a coefficient and not Puerto Rico
r.field.matrix.red_nocor.sig = r.field.matrix.red_nocor[,c(colnames(r.field.matrix.red_nocor)[colnames(r.field.matrix.red_nocor) %in% names.to.keep], "Puerto_Rico","Salary_10Y_4Y_Tui")]
lm.Ear.cost5 = lm(Salary_10Y_4Y_Tui ~ . , data=r.field.matrix.red_nocor.sig)
summary(lm.Ear.cost5)
##
## Call:
## lm(formula = Salary_10Y_4Y_Tui ~ ., data = r.field.matrix.red_nocor.sig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -375654 -43579 -2485 38370 841276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164233.13 9369.11 17.529 < 2e-16 ***
## nursing 71078.79 2890.11 24.594 < 2e-16 ***
## management 62841.63 4950.69 12.694 < 2e-16 ***
## services -26759.45 5047.87 -5.301 1.17e-07 ***
## administration 25262.31 7419.83 3.405 0.000664 ***
## studies -30808.34 3419.50 -9.010 < 2e-16 ***
## engineering 237182.94 3179.71 74.593 < 2e-16 ***
## sciences -7043.00 3558.43 -1.979 0.047806 *
## arts -127543.10 4466.01 -28.559 < 2e-16 ***
## business 17548.74 6565.85 2.673 0.007531 **
## operations -34121.99 7516.53 -4.540 5.68e-06 ***
## related -26407.10 5353.69 -4.933 8.20e-07 ***
## health -17694.76 5197.49 -3.404 0.000665 ***
## research -21301.51 9082.41 -2.345 0.019021 *
## psychology -66147.14 3235.20 -20.446 < 2e-16 ***
## computer 136475.00 5181.89 26.337 < 2e-16 ***
## communication -41129.12 5416.27 -7.594 3.29e-14 ***
## biology -68032.71 3259.82 -20.870 < 2e-16 ***
## science 134073.61 6483.52 20.679 < 2e-16 ***
## media 35245.03 6963.05 5.062 4.20e-07 ***
## methods 20757.16 3733.20 5.560 2.74e-08 ***
## accounting 188256.94 7116.89 26.452 < 2e-16 ***
## information 128102.31 5867.10 21.834 < 2e-16 ***
## marketing 59237.12 4300.26 13.775 < 2e-16 ***
## literature -76359.66 4177.27 -18.280 < 2e-16 ***
## applied 92570.79 5738.96 16.130 < 2e-16 ***
## social -39234.95 8738.41 -4.490 7.17e-06 ***
## educationfitness -37997.77 6683.12 -5.686 1.33e-08 ***
## humanities 135249.54 7995.36 16.916 < 2e-16 ***
## government -155544.34 7686.16 -20.237 < 2e-16 ***
## sociology -45597.19 4591.06 -9.932 < 2e-16 ***
## human 40227.98 7547.80 5.330 9.97e-08 ***
## fine 28126.18 6299.20 4.465 8.06e-06 ***
## resources -18599.24 8655.90 -2.149 0.031671 *
## economics 96856.34 4744.96 20.412 < 2e-16 ***
## finance 97771.59 8242.50 11.862 < 2e-16 ***
## professions 49822.10 9089.81 5.481 4.29e-08 ***
## history -54712.65 4896.60 -11.174 < 2e-16 ***
## mechanical 23971.44 5444.75 4.403 1.08e-05 ***
## communications -64916.90 10705.33 -6.064 1.36e-09 ***
## work 23360.97 10128.83 2.306 0.021102 *
## allied 138937.28 10455.71 13.288 < 2e-16 ***
## businesscommerce 68063.08 6358.83 10.704 < 2e-16 ***
## medical 75681.77 10513.81 7.198 6.37e-13 ***
## technologiestechnicians 55010.57 6755.76 8.143 4.15e-16 ***
## relations -37475.40 7157.85 -5.236 1.67e-07 ***
## systems 30747.48 9742.77 3.156 0.001603 **
## electronics 150766.56 12509.14 12.053 < 2e-16 ***
## ADM_RATE -30678.64 3950.21 -7.766 8.58e-15 ***
## SAT_AVG 106.90 6.45 16.575 < 2e-16 ***
## Puerto_Rico1 -189701.84 32405.36 -5.854 4.89e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79220 on 15689 degrees of freedom
## (3823 observations deleted due to missingness)
## Multiple R-squared: 0.6925, Adjusted R-squared: 0.6915
## F-statistic: 706.7 on 50 and 15689 DF, p-value: < 2.2e-16
In this last linear regression all of the features are statistically significant.These features explain close to 70% of the variance in “Net Operating Income” after 10 years (R - squared 0.6925). A few things that can be pointed out are that students graduating in Puerto Rico should expect to earn $189,701 less over 10 years after graduation (if they keep the same 1st year salary) compared to students graduating from Institutions in one of the 50 States in the US.
SAT scores and admission rates matter in terms of earnings , probably because they are factors associated to prestigious Institutions. An extra point in the SAT AVG Score of the Universty translates into US106 in additional earnings over 10 years. To put it in perspective, the student graduating from an institution with an Average SAT score of 1,500 versus an Institution with an Average SAT Score of 1,100 will earn on average about US40,000 more over 10 years . The Admission rate have a similar impact but in the opposite direction, for every drop of one percentage point (0.01) of the admission rate for the institution the expectation would be to earn about US$306 more over 10 years.
Next, I show the density functions for SAT Scores and Admission Rates.
#For the calculation of the density function, I will start with the original data.
#Since the SAT score and admission rate is by Institution then this information repeats for the Fields of Studies offered at the
#College/University. It is required to remove duplicates
#create Dataframe with required fields
SAT_ADM_INST = FieldofStudy_Earnings_for_rep[,c("INSTNM","ADM_RATE","SAT_AVG")]
#Eliminate duplicates
SAT_ADM_INST= SAT_ADM_INST[!(duplicated(SAT_ADM_INST)),]
# Remove NAs
SAT_ADM_INST=na.omit(SAT_ADM_INST)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
#Create density chart for SAT first
dens <- density(SAT_ADM_INST$SAT_AVG)
df <- data.frame(x=dens$x, y=dens$y)
#want to find what is the percentile corresponding to an average score of 1500
percentile=ecdf(SAT_ADM_INST$SAT_AVG)
prob_1500 = round(percentile(1500),4)
prob_1500
## [1] 0.9896
probs <- c(0.05, 0.25, 0.5, 0.75, 0.95)
quantiles <- quantile(SAT_ADM_INST$SAT_AVG, prob=probs)
df$quant <- factor(findInterval(df$x,quantiles))
quantile_SAT = paste(round(quantiles,0))
label_quants = paste(names(quantiles),quantile_SAT,sep="\n")
ggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) + scale_x_continuous(breaks=quantiles, labels = label_quants ) + scale_fill_brewer(guide="none") + ggtitle("SAT Scores Density") + xlab("% Percentile \n SAT Score") + ylab("Density / Proportion of Institutions")
#Create density for admission rate
dens <- density(SAT_ADM_INST$ADM_RATE)
df <- data.frame(x=dens$x, y=dens$y)
probs <- c(0.05, 0.25, 0.5, 0.75, 0.95)
quantiles <- quantile(SAT_ADM_INST$ADM_RATE, prob=probs)
df$quant <- factor(findInterval(df$x,quantiles))
quantile_ADM = paste(round(quantiles,2))
label_quants = paste(names(quantiles),quantile_ADM,sep="\n")
ggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) + scale_x_continuous(breaks=quantiles, labels = label_quants ) + scale_fill_brewer(guide="none") + ggtitle("Admission Rate Density") + xlab("% Percentile \n Admission Rate") + ylab("Density / Proportion of Institutions")
The density chart for the SAT Scores shows that an SAT Average score of 1123 is the median for all Institutions in the US. In short, with this score about 50% of the instituation in the US would accept the student. Only 5 % of the institutions in the US report an Average SAT score of 1,398 or more. I did a simple calculation for the equivalent percentile for an SAT Score of 1500, it is very close to 99%. In short,only 1% of the institutions in the US report an Average SAT score of 1500 or more (the maximum possible SAT score is 1600).
The density for the admission rate shows a long tail to the left of the chart. The 5th percentile is 28%, more surprising is that the institutions at the 25th percentile accept 54% of the applicants. There are several factors taken into consideration during the admission process but it looks like students have a good chance to get accepted into the institution of their choice probably excluding the ones with very high SAT score requirements.
The next code will create divergence charts by using the words that were taken into consideration as factors for the Linear regreassion that was calculated earlier.
library(scales)
#This contains all of the estimated coefficients
summary(lm.Ear.cost5)[["coefficients"]][,"Estimate"]
## (Intercept) nursing management
## 164233.1332 71078.7898 62841.6339
## services administration studies
## -26759.4495 25262.3119 -30808.3364
## engineering sciences arts
## 237182.9435 -7042.9977 -127543.1000
## business operations related
## 17548.7412 -34121.9930 -26407.1000
## health research psychology
## -17694.7623 -21301.5110 -66147.1432
## computer communication biology
## 136474.9971 -41129.1186 -68032.7097
## science media methods
## 134073.6082 35245.0269 20757.1586
## accounting information marketing
## 188256.9406 128102.3070 59237.1243
## literature applied social
## -76359.6616 92570.7886 -39234.9531
## educationfitness humanities government
## -37997.7705 135249.5384 -155544.3415
## sociology human fine
## -45597.1951 40227.9828 28126.1844
## resources economics finance
## -18599.2418 96856.3412 97771.5857
## professions history mechanical
## 49822.0960 -54712.6508 23971.4394
## communications work allied
## -64916.8963 23360.9666 138937.2813
## businesscommerce medical technologiestechnicians
## 68063.0828 75681.7737 55010.5707
## relations systems electronics
## -37475.3956 30747.4824 150766.5606
## ADM_RATE SAT_AVG Puerto_Rico1
## -30678.6437 106.9021 -189701.8429
#will keep only the featuers for words in Fiedls of Study
Word_Fields = summary(lm.Ear.cost5)[["coefficients"]][,"Estimate"][2:48]
# Negative or Positive contribution
Contribution = ifelse(Word_Fields>0, "Positive", "Negative")
#bind
Word_Fields=as.data.frame(cbind(Word_Fields,Contribution), stringsAsFactors=FALSE)
Word_Fields$Word = rownames(Word_Fields)
#update column Names
colnames(Word_Fields)=c("US_Dollars","Contribution_Direction","Word")
Word_Fields$US_Dollars = as.numeric(Word_Fields$US_Dollars)
#round US dollars
#have to correct the sorting.....
Word_Fields$US_Dollars = round(Word_Fields$US_Dollars,0)
#change words and Contribution to factors
Word_Fields$Word = as.factor(Word_Fields$Word)
Word_Fields$Contribution_Direction = as.factor(Word_Fields$Contribution_Direction)
#sort by dollars
Word_Fields=Word_Fields[order(Word_Fields$US_Dollars,decreasing = FALSE),]
#this is necenssary to conver Word to factor and for it to be sorted correctly
Word_Fields$Word <- factor(Word_Fields$Word, levels = Word_Fields$Word)
Word_Fields$Contribution_Direction <- factor(Word_Fields$Contribution_Direction, levels(Word_Fields$Contribution_Direction)[c(2,1)])
p = ggplot(Word_Fields, aes(x=Word, y=US_Dollars, label=Word)) +
geom_bar(stat='identity', aes(fill=Contribution_Direction), width=.5) +
scale_fill_manual(name="Contribution Direction",
labels = c("Positive", "Negative"),
values = c("Positive"="#00ba38", "Negative"="red")) + scale_y_continuous(labels=dollar_format()) +
labs(subtitle="Postivie or Negative Contribution by Word in Bachelor Degree Field of Study'",
title= "Diverging Bar Plot - US Dollar Amounts") +
coord_flip()
p
The previous chart provides a quick way to identify the words in the Field of Studies that yield a better return over 10 years. As an example, combining engineering with computer or electronics predicts a better financial outcome by a large margin. On the negative side of the chart, there are a few surprises like Biology. It is likely that students in this type of career delay their higher incomes because they pursue Medicine or Veterinary programs after their undergraduate degrees. As for the word government that seems to be the one that affects the results more negatively, this is combined with political science in the name of the field of study so the net result is around 0.
Eventhough , this is probably a good guideline to make a decision; this should be combined with information about the range of incomes (variance) for a particular field of study. If the variance is large, then it becomes more important to research the best schools for that program to maximize the potential income after graduation.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#I will focus on Finance and Financial Management Services. and Economics
Eco_Business = FieldofStudy_Earnings_for_rep[FieldofStudy_Earnings_for_rep$CIPDESC=="Finance and Financial Management Services." | FieldofStudy_Earnings_for_rep$CIPDESC=="Economics.",c("CIPDESC","ADM_RATE","SAT_AVG","Salary_10Y_4Y_Tui")]
# replace Finance and Finacial Management... with Financeand Economics. with Economics
Eco_Business$CIPDESC = ifelse(Eco_Business$CIPDESC=="Economics.","Economics","Finance")
#change to factor
Eco_Business$CIPDESC = as.factor(Eco_Business$CIPDESC)
#omit any NAs
Eco_Business=na.omit(Eco_Business)
fig <- plot_ly(Eco_Business, x = ~ADM_RATE, y = ~SAT_AVG, z = ~Salary_10Y_4Y_Tui, color = ~CIPDESC, colors = c('#BF382A', '#0C4B8E'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Admission Rate'),
yaxis = list(title = 'SAT Avg Score'),
zaxis = list(title = 'Net Income over 10Y')))
fig
The Scatter Plot shows how there is more variance for a degree in Economics (In red). A student by choosing the right Institution(s) is likely to maximize the level of income after graduation. As for the degree in Finance, eventhough there is an outlier, the expected income for the majority of the programs is clustered. In short there is more value in looking for good institutions offering Economics majors than Finance majors. For Finance Majors there are only 3 Instutions with top earnings after graduation as for Economics there are about ~20.
The following are takeaways from this and the previous statistical analysis.
The most significant predictor of earnings after graduating from college is the Field of Study chosen and not the Institution from which the student graduates.
Puerto Rico is the only territory / State, where the graduates should expect to earn less after graduation, statistically speaking. To address this situation the students in this territory should plan to move to one of the 50 states, after graduation, to increase their earning potential substantially.
SAT Scores and Instituion Admission rates predict less than 4% of the variance of the earnings.
Once a Field fo Study is chosen, it makes sense to look for Specific Institutions when there is a high degree of varaince in the expected Operating Income. For example studients pursuing careers in Computer Engineering, Computer Science, Economics and Nursing should look into the top institutions in terms of potential operating income.
For degress that yield lowest levels of earnings after graduation, it is of the most importance for the students to attend institutions with the lowest possible costs. Student Loans should be avoided as much as possible when pursuing these degrees.